Based on the brief, we would like these columns in our dataframe for modelling:
auckland_cycle_2018 <- read.csv("data/aucklandcyclists2018.csv")
auckland_cycle_2019 <- read.csv("data/aucklandcyclists2019.csv")
auckland_cycle_2020 <- read.csv("data/aucklandcyclists2020.csv")
auckland_cycle_2021 <- read.csv("data/aucklandcyclists2021.csv")
auckland_cycle_2022 <- read.csv("data/aucklandcyclists2022.csv")
monitoring_locations <- read.csv("data/monitoringlocations.csv")
daily_rainfall <- read.csv("data/daily-rainfall-at-30-sites-1960-to-2022.csv")
daily_rainfall_data_dict <- read.csv("data/daily-rainfall-at-30-sites-1960-to-2022-dictionary.csv")
We need all date formats to be in YYYY-MM-DD, and as date objects.
auckland_cycle_2018 <- auckland_cycle_2018 %>%
mutate(date = as.Date(Date, format = "%Y-%m-%d")) %>%
dplyr::select(-Date)
auckland_cycle_2019 <- auckland_cycle_2019 %>%
mutate(date = dmy(str_remove(Date, "^[A-Za-z]+,\\s*"))) %>%
mutate(date = as.Date(date, format = "%d-%m-%Y")) %>%
dplyr::select(-Date)
auckland_cycle_2020 <- auckland_cycle_2020 %>%
mutate(date = dmy(str_remove(Date, "^[A-Za-z]+\\s+"))) %>%
mutate(date = as.Date(date, format = "%Y-%m-%d")) %>%
dplyr::select(-Date)
auckland_cycle_2021 <- auckland_cycle_2021 %>%
mutate(date = as.Date(Date, format = "%Y-%m-%d")) %>%
dplyr::select(-Date)
auckland_cycle_2022 <- auckland_cycle_2022 %>%
mutate(date = as.Date(Date, format = "%Y-%m-%d")) %>%
dplyr::select(-Date)
auckland_cycle_2018 <- auckland_cycle_2018 %>%
pivot_longer(cols = -date, names_to = "cycle_recording_site", values_to = "cycle_count") %>%
drop_na(cycle_count) %>%
mutate(cycle_count = round(cycle_count))
auckland_cycle_2019 <- auckland_cycle_2019 %>%
pivot_longer(cols = -date, names_to = "cycle_recording_site", values_to = "cycle_count") %>%
drop_na(cycle_count)
auckland_cycle_2020 <- auckland_cycle_2020 %>%
pivot_longer(cols = -date, names_to = "cycle_recording_site", values_to = "cycle_count") %>%
drop_na(cycle_count)
auckland_cycle_2021 <- auckland_cycle_2021 %>%
pivot_longer(cols = -date, names_to = "cycle_recording_site", values_to = "cycle_count") %>%
drop_na(cycle_count)
auckland_cycle_2022 <- auckland_cycle_2022 %>%
pivot_longer(cols = -date, names_to = "cycle_recording_site", values_to = "cycle_count") %>%
drop_na(cycle_count)
auckland_cycle <- rbind(auckland_cycle_2018, auckland_cycle_2019, auckland_cycle_2020, auckland_cycle_2021, auckland_cycle_2022)
auckland_cycle <- auckland_cycle %>%
mutate(cycle_count = round(cycle_count)) %>%
mutate(cycle_recording_site = as.factor(cycle_recording_site))
We drop all NAs. It would be difficult to impute, and we already have quite a bit of data. Though the missing data could have an effect on biasing our estimates.
I’m unsure why we have non-integer counts for cycle counts so I rounded all cycle counds.
Two of the rainfall monitoring locations are relevant to Auckland: “Auckland (Auckland)”, close to the airport, and “Whangaparāoa (Auckland)”.
We only need need rainfall data corresponding to cycling data between 2018-01-01 and 2022-12-31.
daily_rainfall_auckland <- daily_rainfall %>%
filter(site == "Auckland (Auckland)") %>%
mutate(date = as.Date(date, format = "%Y-%m-%d")) %>%
filter(date >= as.Date("2018-01-01"))
daily_rainfall_whangaparaoa <- daily_rainfall %>%
filter(site == "Whangaparāoa (Auckland)") %>%
mutate(date = as.Date(date, format = "%Y-%m-%d")) %>%
filter(date >= as.Date("2018-01-01"))
auckland_rainfall_station_coordinates <- data_frame(
lat = -37.00813,
lon = 174.7887
)
whangaparaoa_rainfall_station_coordinates <- data_frame(
lat = -36.606,
lon = 174.835
)
rainfall_stations_to_cycle_recording_sites <- data_frame(cycle_recording_site = monitoring_locations$Location)
rainfall_stations_to_cycle_recording_sites <- rainfall_stations_to_cycle_recording_sites %>%
mutate(distance_to_auckland = pmap_dbl(
list(lat = monitoring_locations$Latitude, lon = monitoring_locations$Longitude),
~ distHaversine(c(..2, ..1), c(auckland_rainfall_station_coordinates$lon, auckland_rainfall_station_coordinates$lat))
)) %>%
mutate(distance_to_whangaparaoa = pmap_dbl(
list(lat = monitoring_locations$Latitude, lon = monitoring_locations$Longitude),
~ distHaversine(c(..2, ..1), c(whangaparaoa_rainfall_station_coordinates$lon, whangaparaoa_rainfall_station_coordinates$lat))
))
auckland_cycle <- auckland_cycle %>%
left_join(rainfall_stations_to_cycle_recording_sites, by = "cycle_recording_site") %>%
left_join(daily_rainfall_auckland %>% dplyr::select(date, rainfall) %>% dplyr::rename(rainfall_auckland = rainfall), by = "date") %>%
left_join(daily_rainfall_whangaparaoa %>% dplyr::select(date, rainfall) %>% dplyr::rename(rainfall_whangaparaoa = rainfall), by = "date") %>%
mutate(weighted_average_rainfall = (rainfall_auckland / distance_to_auckland + rainfall_whangaparaoa / distance_to_whangaparaoa) /
(1 / distance_to_auckland + 1 / distance_to_whangaparaoa)) %>%
dplyr::select(-c(distance_to_auckland, distance_to_whangaparaoa, rainfall_auckland, rainfall_whangaparaoa))
Another approach would be to find the closest rainfall station and use those measurements for that day at that cycle recording site. However, I believe using the weighted average between the two rainfall stations in Auckland weighted by their relative distances to the cycle recording sites is the best approach as it balances out bias created from having rainfall measurements taken at a vastly different location to the cycle recording sites.
auckland_cycle <- auckland_cycle %>%
mutate(day_of_week = as.factor(weekdays(date))) %>%
mutate(day_of_week = factor(day_of_week,
levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))) %>%
mutate(month = factor(month(date), labels = month.name)) %>%
mutate(year = as.factor(year(date)))
These covariates should be sufficient to account for seasonality (in addition to covid_status).
auckland_cycle <- auckland_cycle %>%
mutate(covid_status = case_when(
date >= as.Date("2018-01-01") & date <= as.Date("2020-03-24") ~ "pre-covid",
date >= as.Date("2020-03-25") & date <= as.Date("2020-04-27") ~ "level-4-lockdown",
date >= as.Date("2020-04-28") & date <= as.Date("2020-05-10") ~ "level-3-lockdown",
date >= as.Date("2020-05-11") & date <= as.Date("2020-05-24") ~ "level-2-lockdown",
date >= as.Date("2020-05-25") & date <= as.Date("2020-06-07") ~ "level-1-lockdown",
date >= as.Date("2020-06-08") & date <= as.Date("2020-08-10") ~ "inter-lockdown",
date >= as.Date("2020-08-11") & date <= as.Date("2020-08-29") ~ "level-3-lockdown",
date >= as.Date("2020-08-30") & date <= as.Date("2020-09-22") ~ "level-2.5-lockdown",
date >= as.Date("2020-09-23") & date <= as.Date("2020-10-06") ~ "level-2-lockdown",
date >= as.Date("2020-10-07") & date <= as.Date("2021-02-13") ~ "inter-lockdown",
date >= as.Date("2021-02-14") & date <= as.Date("2021-03-06") ~ "level-3-lockdown",
date >= as.Date("2021-03-07") & date <= as.Date("2021-08-16") ~ "inter-lockdown",
date >= as.Date("2021-08-17") & date <= as.Date("2021-09-20") ~ "level-4-lockdown",
date >= as.Date("2021-09-21") & date <= as.Date("2021-12-01") ~ "level-3-lockdown",
date >= as.Date("2021-12-02") & date <= as.Date("2021-12-29") ~ "traffic-red-lockdown",
date >= as.Date("2021-12-30") & date <= as.Date("2022-09-11") ~ "traffic-orange-lockdown",
date >= as.Date("2022-09-12") & date <= as.Date("2022-12-31") ~ "post-covid",
)) %>%
mutate(covid_status = as.factor(covid_status))
I obtained lockdown and restriction dates from policycommons.ac.nz.
I believe there aren’t any relevant colliders or mediators for rainfall on cycle count.
These plots show direct relationships between the concerning covariate with direct effects on cycle_count and confounders of rainfall, supporting our causal diagram.
There was noticeably more cycle counts in 2019. There appears to be an overall downward trend on cycling counts across the years. People often cycle for leisure or transport (usually to and from work). Since COVID-19, there has been a rise of hybrid/remote work, which in combination with lockdowns, could explain the downward trend in cycle counts. We include year in our model. We include year in our model as this should have direct effects on cycle counts and a confounder of rainfall on cycle counts.
There appears to be seasonality - the colder months (June, July, and August) appear to have less cycle total counts compared to warmer months (December, January, February, March). There are unusually low total cycle counts in December, this could be due to the Christmas Holidays. We include month in our model as this should have direct effects on cycle counts and a confounder of rainfall on cycle counts.
It would be interesting to see the trend and seasonality across all dates. Although dates will not be used in our model.
The plot of total cycle counts vs date supports the seasonality seen between month and total cycle count, and downward trend seen between year and total cycle count.
However these trends and seasonality are disrupted during the COVID-19 lockdowns, hence why we need to adjust for the COVID status covariate when fitting our model.
There appears to be seasonality - the middle of the weekday (Tuesday and Wednesday) has the highest total cycle counts, whereas the weekend (Saturday and Sunday) have the lowest total cycle counts. This supports the assumption that people often cycle for leisure or transport (usually to and from work), the middle of the weekday being when people are most likely working. We include day of week in our model as it has direct effects on cycle count.
I averaged the cycle counts across all dates and locations within a given COVID status, this is because each status has different periods. Note that averages can be biased due to other seasonal and trend effects.
There appears to be a higher average cycle count before the COVID-19 pandemic, which decreases significantly starting from the first lockdown in March 2020. The “post-COVID” period also shows a noticeably lower average cycle count. The primary focus of this analysis is to model the relationship between rainfall and cycle counts. Although predicting and explaining cyclist behavior during the COVID-19 period is challenging, we adjust for COVID status in our model to reduce noise as it directly impacts cycle counts. However, our primary attention is not on the inference of the COVID status covariate itself but rather on understanding the relationship between rainfall and cycle counts.
There is non-uniformity between total cycle count vs cycle recording site. Certain cycle recording sites such as NW Cycleway Kingslandm, Quay St. Eco Display Classic, and Quay St. Spark Arena have more cycle counts across the dataset compared to other cycle recording sites such as Matakana. There seems be an effect of rural and urbanised recording sites on total cycle count, with urbanised sites being more popular. We include cycle recording site in our model as it has direct effects on cycle count.
There appears to be some seasonality of year on rainfall, this could be due to El Niño-Southern Oscillation (ENSO). We include year in our model as a confounder for rainfall on cycle count, and don’t need to include ENSO index as year covariate should cover the effects of ENSO.
Note: The bands are caused by taking rainfall measurements on the same day but at different sites with similar amounts of weighted average rainfall.
There appears to be seasonality of month on rainfall - winter months (June, July, and August) having the highest rainfall compared to other months. January, a Summer month, has the lowest rainfall compared to other months. We include month in as a confounder for rainfall on cycle count.
There is uniformity between rainfall vs cycle recording site. However some sites (GI to Tamaki Dr. Section 2, Matakana, New Lynn to Avondale SUP, and Quay St. Totem UZELT) appear to have much lower rainfalls than all other sites, this is possibly due to the missing values in the data.
The uniformity could be due to not having local rainfall data which accounts for micro-climates (i.e. Matakana and Karangahape Rd. should have different micro-climates). We are weighting the average between measurements from only two rainfall stations based on distance, hence why we would expect there to be less variation between rainfall across all cycle recording sites.
Despite the limitation with our data, we still stand to include cycle recording site in our causal diagram as a confounder for rainfall on cycle count. Cycle recording site is still included in our model, however it’s estimates probably tell us more about the direct effects of cycle recording site on cycle count.
A clear negative trend between rainfall and cycle count can be seen across all sites and at each site. Higher rainfall measurements results in less cycle counts, inversely lower rainfall measurements results in more cycle counts.
There seems to be an inflated amount of 0 mm rainfall recordings.
Since we are dealing with counts, we use a Poisson model as a first step, with first order covariates and no interactions.
cycle_count_poisson_model <- glm(cycle_count ~ weighted_average_rainfall + day_of_week + month + year + cycle_recording_site + covid_status - 1, family = poisson(link = "log"), data = auckland_cycle)
cycle_count_poisson_model_summary
##
## Call:
## glm(formula = cycle_count ~ weighted_average_rainfall + day_of_week +
## month + year + cycle_recording_site + covid_status - 1, family = poisson(link = "log"),
## data = auckland_cycle)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -41.262 -4.572 -0.732 3.234 193.406
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## weighted_average_rainfall -2.525e-02 4.573e-05 -552.16 <2e-16 ***
## year2019 4.189e-02 6.641e-04 63.08 <2e-16 ***
## year2020 8.752e-02 1.049e-03 83.40 <2e-16 ***
## year2021 1.051e-01 1.387e-03 75.77 <2e-16 ***
## year2022 3.476e-01 8.077e-03 43.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 239480046 on 79449 degrees of freedom
## Residual deviance: 4361284 on 79366 degrees of freedom
## AIC: 4907324
##
## Number of Fisher Scoring iterations: 5
The residual deviance appears a lot larger than the degrees of freedom, this suggests over-dispersion. The next step would be to fit a negative binomial model with a dispersion parameter.
The average effects of weighted average rainfall are: for every one-unit (1 mm) increase of rainfall, the expected count of cyclists decreases by a factor of \(e^{-0.02525} \approx 0.975\), or about \(2.5\%\) less.
The fit line looks quite noisy, however we can see that the line follows the trend quite closely with cycle counts decreasing with increasing rainfall measurement.
The variance of the scatter is not constant, it increases as the fitted values increase, suggesting that there is over-dispersion. A solution is use a negative binomial model.
cycle_count_negbin_model <- glm.nb(cycle_count ~ weighted_average_rainfall + day_of_week + month + year + cycle_recording_site + covid_status - 1, data = auckland_cycle)
cycle_count_negbin_model_summary
##
## Call:
## glm.nb(formula = cycle_count ~ weighted_average_rainfall + day_of_week +
## month + year + cycle_recording_site + covid_status - 1, data = auckland_cycle,
## init.theta = 4.069766297, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.3553 -0.7794 -0.1173 0.4658 12.3624
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## weighted_average_rainfall -0.0209893 0.0002997 -70.031 < 2e-16 ***
## year2019 0.0339994 0.0059227 5.740 9.44e-09 ***
## year2020 0.1127775 0.0100965 11.170 < 2e-16 ***
## year2021 0.1318295 0.0126806 10.396 < 2e-16 ***
## year2022 0.2689079 0.0567426 4.739 2.15e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(4.0698) family taken to be 1)
##
## Null deviance: 72246831 on 79449 degrees of freedom
## Residual deviance: 83839 on 79366 degrees of freedom
## AIC: 926037
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 4.0698
## Std. Err.: 0.0208
##
## 2 x log-likelihood: -925868.5680
The coefficient for weighted_average_rainfall has increased slightly.
The residual deviance appears a lot smaller now. The dispersion parameter has been found to be 4.1.
The average effects of weighted average rainfall are: for every one-unit (1 mm) increase of rainfall, the expected count of cyclists decreases by a factor of \(e^{-0.0209893} \approx 0.979\), or about \(2.2\%\) less.
Fitted lines appear almost identical to the Poisson model, however slightly less noise, and preserving the same trend.
The residuals have less of a trend and appear much smaller for the negative binomial model compared to the Poisson model. This is probably the final model of choice.
I did not include interactions, and used a first order model with all the covariates of concern. An extension of this investigation would be to explore interaction effects between certain covariates and their effect on cycle count.
There appears be zero-inflated rainfall measurements, although it would be tempting to use a zero-inflated Poisson, that is only if the dependent variable is zero-inflated. In this case we could fit a compound Poisson model and account for over-dispersion like in our negative binomial model of choice.
Although I included the ENSO index in the causal diagram and is a confounder for rainfall, I decided not to investigate this further in the model, as I believe the El Nino / El Nina effects are already explained by the year covariate.
Public holidays such as Easter, Christmas, and New Years to name a few, are an important covariate which has direct effects on the cycle count. This is a covariate I would model as an extension of this analysis, that may serve to decrease noise, decreasing residual of rainfall on cycle counts.
Rainfall is a good indicator of weather conditions favorable for cycling, however we should also consider temperate and wind speed in our model, that may serve to decrease the residual of rainfall on cycle counts. However I’m not well versed in climate science, and adjusting for these other covariates might introduce mediator in our model which could be a problem.
Based on my research, it appears that rainfall effects are more pronounced between weekdays and weekends in very large metropolitan areas. I chose not to include this covariate in the Causal DAG and model, as I suspect the weekend effect may be negligible in a smaller city like Auckland. Additionally, our rainfall data is not the most comprehensive, which might render this consideration less impactful. Even if there were a slight effect, it should be adequately accounted for by the day of week covariate.